scene knowledge
Advancing Surgical VQA with Scene Graph Knowledge
Yuan, Kun, Kattel, Manasi, Lavanchy, Joel L., Navab, Nassir, Srivastav, Vinkle, Padoy, Nicolas
Modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with language capabilities is emerging as a necessity. Our work aims to advance Visual Question Answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. First, we propose a Surgical Scene Graph-based dataset, SSG-QA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. Our SSG-QA dataset provides a more complex, diverse, geometrically grounded, unbiased, and surgical action-oriented dataset compared to existing surgical VQA datasets. We then propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM), which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Our comprehensive analysis of the SSG-QA dataset shows that SSG-QA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-QA
Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI
Yaoxian, Song, Penglei, Sun, Haoyu, Liu, Zhixu, Li, Wei, Song, Yanghua, Xiao, Xiaofang, Zhou
Embodied AI is one of the most popular studies in artificial intelligence and robotics, which can effectively improve the intelligence of real-world agents (i.e. robots) serving human beings. Scene knowledge is important for an agent to understand the surroundings and make correct decisions in the varied open world. Currently, knowledge base for embodied tasks is missing and most existing work use general knowledge base or pre-trained models to enhance the intelligence of an agent. For conventional knowledge base, it is sparse, insufficient in capacity and cost in data collection. For pre-trained models, they face the uncertainty of knowledge and hard maintenance. To overcome the challenges of scene knowledge, we propose a scene-driven multimodal knowledge graph (Scene-MMKG) construction method combining conventional knowledge engineering and large language models. A unified scene knowledge injection framework is introduced for knowledge representation. To evaluate the advantages of our proposed method, we instantiate Scene-MMKG considering typical indoor robotic functionalities (Manipulation and Mobility), named ManipMob-MMKG. Comparisons in characteristics indicate our instantiated ManipMob-MMKG has broad superiority in data-collection efficiency and knowledge quality. Experimental results on typical embodied tasks show that knowledge-enhanced methods using our instantiated ManipMob-MMKG can improve the performance obviously without re-designing model structures complexly. Our project can be found at https://sites.google.com/view/manipmob-mmkg
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
Chen, Zhihong, Zhang, Ruifei, Song, Yibing, Wan, Xiang, Li, Guanbin
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at \url{https://github.com/zhjohnchan/SK-VG}.
Integration of knowledge to support automatic object reconstruction from images and 3D data
Boochs, Frank, Marbs, Andreas, Truong, Hung, Hmida, Helmi Ben, Karmacharya, Ashish, Cruz, Christophe, Habed, Adlane, Voisin, Yvon, Nicolle, Christophe
Object reconstruction is an important task in many fields of application as it allows to generate digital representations of our physical world used as base for analysis, planning, construction, visualization or other aims. A reconstruction itself normally is based on reliable data (images, 3D point clouds for example) expressing the object in his complete extent. This data then has to be compiled and analyzed in order to extract all necessary geometrical elements, which represent the object and form a digital copy of it. Traditional strategies are largely based on manual interaction and interpretation, because with increasing complexity of objects human understanding is inevitable to achieve acceptable and reliable results. But human interaction is time consuming and expensive, why many researches has already been invested to use algorithmic support, what allows to speed up the process and to reduce manual work load. Presently most of such supporting algorithms are data-driven and concentate on specific features of the objects, being accessible to numerical models. By means of these models, which normally will represent geometrical (flatness, roughness, for example) or physical features (color, texture), the data is classified and analyzed. This is successful for objects with low complexity, but gets to its limits with increasing complexness of objects. Then purely numerical strategies are not able to sufficiently model the reality. Therefore, the intention of our approach is to take human cognitive strategy as an example, and to simulate extraction processes based on available human defined knowledge for the objects of interest. Such processes will introduce a semantic structure for the objects and guide the algorithms used to detect and recognize objects, which will yield a higher effectiveness. Hence, our research proposes an approach using knowledge to guide the algorithms in 3D point cloud and image processing.